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By David Siegmund^, Benjamin Yakir^ and Nancy R. Zhang^ 

Stanford University, Hebrew University of Jerusalem and Stanford 

University 

Given a set of aligned sequences of independent noisy observa- 
tions, we are concerned with detecting intervals where the mean val- 
ues of the observations change simultaneously in a subset of the se- 
quences. The intervals of changed means are typically short relative 
to the length of the sequences, the subset where the change occurs, 
the "carriers," can be relatively small, and the sizes of the changes 
can vary from one sequence to another. This problem is motivated by 
the scientific problem of detecting inherited copy number variants in 
aligned DNA samples. We suggest a statistic based on the assump- 
tion that for any given interval of changed means there is a given 
fraction of samples that carry the change. We derive an analytic ap- 
proximation for the false positive error probability of a scan, which 
is shown by simulations to be reasonably accurate. We show that 
the new method usually improves on methods that analyze a single 
sample at a time and on our earlier multi-sample method, which is 
most efficient when the carriers form a large fraction of the set of 
sequences. The proposed procedure is also shown to be robust with 
respect to the assumed fraction of carriers of the changes. 

1. Introduction. This paper is motivated by the problem of detecting 
inherited DNA copy number variants (CNV). CNV are gains and losses 
of segments of chromosomes, and comprise an important class of genetic 
variation in human populations. Various laboratory techniques have been 
developed to measure DNA copy number [Pinkel et al. (1998); Pollack et 
al. (1999); Snijders et al. (2001); Bigneh et al. (2004); Peiffer et al. (2006)]. 
These measurements are taken at a set of probes, each mapping to a specific 
location in the genome. The data thus produced are a set of linear sequences 
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of measurement intensities, one for each biological sample in the study. If 
a sample contains a CNV at a particular genomic region, then depending 
on whether the CNV is a gain or loss, the intensities increase or decrease 
relative to their average values in that region. 

Studies of DNA copy number arise in two distinct contexts, which yield 
data with different characteristics. One of these is cancer genetics, where 
somatic changes in DNA copy number occur in the genomes of tumor cells. 
[See Pinkel and Albertson (Pinkel and Albertson, 2005) for a review.] These 
changes can be quite long, sometimes involving entire chromosomes or chro- 
mosomal arms. The second context, which motivates the problem formula- 
tion in this paper, involves inherited regions of CNV. These are population 
polymorphisms. As such, they hypothetically could be functional variants 
contributing to phenotypic variability, and hence are of interest in associa- 
tion studies. Alternatively, they can be neutral markers for tracing distant 
relationships in populations, which could be used in population genetics. 
Since inherited regions of CNV are typically quite short, often covering only 
one or a few probes, they are more difficult to detect in individual genomes 
than their tumor counterparts, which has led some investigators to place 
a minimal length of 2-10 probes on a CNV [e.g., Redon et al. (Redon et al., 
2006), McCarroll et al. (McCarroll et al., 2008), Walsh et al. (Walsh et al, 
2008)] even though this restriction artificially eliminates many candidates 
from contention. An illustrative segment of CNV data from a group of nor- 
mal samples are shown in the form of a heatmap in Figure 1. Each row of 
the heatmap is a sample, and each column is a probe. The probes map to 
ordered locations along a chromosome. For illustration, the region depicted 
in Figure 1 contains a CNV between probes 1800 and 1900 that is visibly 
apparent as stretches of high (red) or low (blue) intensities in a few of the 
samples. Note that the breakpoints are shared across samples, and that the 
shift in mean may be positive for some individuals and negative in others. 

Most current procedures process the samples one at a time in the detection 
of CNV. For recent reviews, see Lai et al. (Lai et al., 2005), Willenbrock and 
Fridlyand (Willenbrock and Fridlyand, 2005), and Zhang (Zhang, 2010). Lai 
et al. and Willenbrock and Fridlyand compare many of the existing methods 
on a common data set. In this paper we take the view that since these CNV 
are population level polymorphisms, there is the possibility to pool data 
across individuals (samples) to boost the power of detection of simultaneous 
changes occurring in a fraction of the sequences. See Zhang et al. (Zhang et 
al., 2010) for more scientific background and additional references. 

Following Olshen et al. (Olshen et al., 2004), we formulate this problem 
as one of detecting intervals where the mean of a sequence of independent 
random variables shows a change from its baseline, that is, overall mean, 
value. Zhang et al. (Zhang et al., 2010) extended the approach of Olshen et 
al. to the case of multiple aligned sequences and the problem of detecting 
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Fig. 1. An example segment of DNA copy number data. Each row is a sample, and each 
column IS a probe. Gains and losses m copy number manifest as stretches of low or high 
intensities. 



intervals of change that occur at identical locations in some of the sequences. 
They proposed a sum of chi-squares statistic, which is effectively the likeli- 
hood ratio statistic assuming normal errors, and showed that a simultaneous 
scan of all sequences for a shared signal across multiple profiles can improve 
power compared to a method that separately segments each individual se- 
quence, especially if a moderate to large fraction of the sequences "carry" 
the change. [The methods of Olshen et al. (Olshen et al., 2004) and Zhang 
et al. (Zhang et al., 2010) are reviewed in more detail in Section 2.2.] 

Since the sum of chi-squares statistic was designed for the situation where 
a moderate to large fraction of the sequences carry a change, it can have low 
power to detect the many CNV that are rare variants, where the fraction of 
carriers is less than ~5%. The accurate detection of rare variants is becom- 
ing increasingly important, due to the recent interest in association studies 
targeting rare variants [cf. the review by McCarroll (McCarroll, 2008)]. Al- 
though Zhang et al. (Zhang et al., 2010) also suggested a class of "weighted" 
statistics to detect rare variants, the method they used to approximate p- 
values for the sum of chi-squares statistic relies on the spherical symmetry 
of the standard multivariate normal distribution, and does not adapt to the 
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more general scan statistics considered in this paper. Our main theoreti- 
cal result is a more general method to approximate the false positive rate 
for a wide class of multi-sample scan statistics, which includes the sum of 
chi-squares statistic as a special case. We show by simulations that the ap- 
proximations are quite accurate. This allows us to assess the significance of 
genome-wide studies, which often involve over a million probes and thou- 
sands of samples. Simulations and other computer intensive methods are 
very difficult to implement for scans of such large data sets. 

In Section 2 we formulate the basic model and suggest a class of statis- 
tics based on the assumption of a mixture of mean levels at each variant 
interval. Next we generalize the method introduced by Siegmund, Yakir and 
Zhang (2010) to provide analytic approximations to the false positive rates 
of these statistics, and we use Monte Carlo experiments to show that the 
approximations are very accurate. In Section 4 we compare different statis- 
tics and illustrate the benefits of pooling information across samples, even 
in the case where the proportion of carriers is very low. Section 5 contains 
a test case involving actual CNV data. Section 6 contains a discussion, and 
in Appendix A we sketch a proof of our false positive rate approximations. 

The independence and normality assumptions made in this paper also 
underlie most previous approaches to this problem. Raw data from popu- 
lar genotyping microarray platforms often deviate from these assumptions, 
but most of this deviation can be eliminated by appropriate normalization 
procedures. A description of data preprocessing is given in Section 5. 

We consider here the primary problem to be detection of the intervals of 
change. In many cases, the carriers, that is, the subset of samples where the 
changes have occurred, are relatively obvious from inspection of the data af- 
ter the intervals have been reported. In other cases, determining the carriers 
poses a difficult auxiliary problem, because of the very large dimension of 
the parameter space. Zhang et al. (Zhang et al., 2010) suggested a simple ad 
hoc thresholding algorithm. We expect to discuss in the future more system- 
atic criteria that involve modeling of probe-specific effects, clustering across 
samples, and a generalization to multiple sequences of the BIC method of 
Zhang and Siegmund (Zhang and Siegmund, 2007). 

For data from some platforms (e.g., the SNP genotyping arrays from 
Affymetrix and Illumina), other information, such as A and B allele frequen- 
cies, is available to improve the accuracy of CNV detection. Some methods 
[Wang et al. (2007); Colella et al. (2007)] use a Hidden Markov model to 
detect CNV based on both the total intensity and the allele specific data. 
While Colella et al. (2007) mentioned that their hidden Markov model can 
be extended to process multiple samples simultaneously, no convincing evi- 
dence was presented that the allele specific analysis, when combined across 
samples, improves accuracy. The reason, at least for the Affymetrix plat- 
form, is that allele specific frequencies are also prone to artifacts and can be 
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much noisier than total intensity data. While effective measures for artifact 
removal for total intensity data have been developed (see Section 5) and 
allow successful cross-sample integration, appropriate measures appear to 
be lacking for normalization of allele specific frequencies. Although meth- 
ods based on allele specific data undoubtedly have a role to play in CNV 
detection, in this paper we focus on the integration of total intensity data 
across samples, which admits an appealingly simple and general model that 
appears to be more generally useful. 

Remark. Although the formulation and results in this paper have been 
motivated by problems associated with detection of CNV, the multisample 
change-point model that we study may be useful in quite different contexts. 
One of current interest is sequential detection of a change-point by a dis- 
tributed array of sensors [e.g., Tartakovsky and Polunchenko (Tartakovsky 
and Polunchenko, 2007)], where our p- value approximation can be used as 
the starting point to develop an approximation to the average run length 
when there is no change-point. Another example is briefly described in the 
Appendix. 

2. Change-point models and scan statistics. 

2.1. Problem formulation. The observed data is a two-dimensional array 
{Vit ■i^i^N,l<t< T}, where yu is the data point for the ith profile at 
location t, N is the total number of profiles, and T is the total number 
of locations. In genome-wide profiling studies, is usually in the tens to 
the thousands, and T is usually in the hundreds of thousands. We assume 
that for each i, the random variables = {ya - l < t < T} are mutually 
independent and Gaussian with mean values fin and variances af. Under 
the null hypothesis, the means for each profile are identical across locations. 
Under the alternative hypothesis of a single changed interval, there exist 
values I < Ti < T2 <T and a set of profiles ^7 C {1, . . . ,N}, such that for 

(2.1) fJ'it = lJ-i + SiI{Tl<t<T2}^ 

where the 6i are nonzero constants and /ij is the baseline mean level for pro- 
file i, which may not necessarily be known in advance. Under the alternative 
hypothesis we refer to (ri , T2] as a variant interval and as the set of carri- 
ers, that is, the subset of samples that have a changed mean in that interval. 
If the alternative hypothesis is true, we are interested primarily in detecting 
this situation and in estimating the endpoints of the variant interval, and 
secondarily in determining the carriers. 

In DNA copy number data, the magnitude of change in signal intensity 
varies across samples for any given CNV, even when the underlying change 
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in copy number is the same. This is due to differences in sample handling, 
and motivates the assignment of a new 5i parameter to each carrier; see 
Zhang et al. (Zhang et al., 2010) for examples. 

In many applications, including CNV detection, there are usually multiple 
variant intervals defined by different ri, T2 and J. We describe the model 
and statistics assuming the simple case where there is at most one variant 
interval. If the number of intervals is small and the intervals are widely 
spaced, a single application will detect multiple intervals. More generally, 
these statistics can be combined with the recursive segmentation algorithm 
in Zhang et al. (Zhang et al., 2010) to treat the case where there are multiple 
variant intervals. 

2.2. Review of scan statistics. First we review the case of a single se- 
quence of observations. Initially we suppress the dependence of our notation 
on the profile indicator i. For {yi, . . . , yx}, let St = yi + ■ • ■ + yt, yt = St/t, 
and (7^ = Ylri{yt~yTY the maximum likelihood estimate of variance. 
Olshen et al. (Olshen et al., 2004) used likelihood ratio based statistics for 
analysis of DNA copy number data for a single sequence. The statistic they 
suggested was 

(2.2) max[/2(s,t), 

s,t 

where 

(2.3) C/(s, t) = a-'{St -Ss-{t- s)yT}/[it - s){l - {t - s)/T}]^/\ 

and the max is taken over 1 < s < t < T,t — s < Ti. Here Ti < T is an 
assumed upper bound on the length of the variant interval, which for some 
applications may be much smaller than T. 

If the error standard deviation a were known and used in place of a in 

(2.3) , (2.2) would be the likelihood ratio statistic. The denominator in (2.2) 
standardizes the variance of the numerator, and under the null hypothesis of 
no change, U'^{s,t) is asymptotically distributed as Xi- practice, a must 
be estimated. Since T is usually very large in typical applications, we shall 
for theoretical developments treat a as known. Then, we can without loss 
of generality set a = 1. 

For data involving N sequences, to test the null hypothesis Hq that 6i=0 
for all 1 <i < N versus the alternative that for some values of ri < T2 
at least some 5i are not zero, Zhang et al. (Zhang et al., 2010) proposed 
a direct generalization of (2.2): 

N 

(2.4) maxZ(s,t) where Z(s, t) = V L/f(s, t) 

s<t ^ ^ 

i=l 
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and Ui{s,t) is the sequence specific statistic defined in (2.3) for the ith 
sequence. Again, if the variances are known, (2.4) is the generahzed log hke- 
Uhood ratio statistic for testing Hq versus Ha- For each fixed s <t, the nuh 
distribution of Z{s,t) is approximately with degrees of freedom. Even 
if the samples are related (say, replicates or members of the same family), 
this relatedness only matters under the alternative hypothesis that there is 
a CNV. Thus, even for related samples, as long as they are independent un- 
der the null hypothesis, the null distribution of Z[s,t) would be Xn- Large 
values of Z{s,t) are evidence against the null hypothesis. If the null hy- 
pothesis is rejected, the maximum likelihood estimate of the location of the 
variant interval is {s*,t*) = argmax^ ^ Z(s, t). 

2.3. Mixture model. Whereas conducting a separate analysis for each 
individual sequence requires that each sample show strong evidence for the 
detection of a variant interval, the sum of statistic goes to the other 
extreme of favoring situations where many samples have relatively weak 
evidence. For cases where N is moderately large, say, in the 100s or even 
1000s, it seems reasonable to consider intermediate statistics that require 
each sample to show moderate evidence before they are allowed to make 
a substantial contribution to the overall statistic. 

Consider again the problem as originally formulated, where JT" denotes the 
set of samples containing the same variant interval, and let Qi{s,t) denote 
the indicator that i € J' and that the aligned change-points are s, t. If Qi{s, t) 
were observed, the generalized log-likelihood ratio statistic, maximizing over 
the individual jumpsizes {6i :i = 1, . . . , N}, would be 

N 

max V log[{l - Qi{s, t)} + Qi{s, t)e^?^'^'y'] 

s,t ^ ' 

1=1 

= maxVQi(s,t)C/2(s,i)/2. 

s,t — ' 

1=1 

Since Qi{s,t) is not observed, we have considered two surrogate statistics. If 
we assume that po G [0, 1) is a prior probability that Qi{s,t) = 1, we could 
consider the left-hand side of (2.5) with pQ substituted for Qi{s,t), that is, 

AT 

(2.6) max Vlog[l - Po +Poe^»'("'*)/^]. 

s,t ^ ' 

i=l 

This is the mixture likelihood ratio statistic. We could also consider the 
posterior distribution of Qi{s,t), given the data, which depends on the un- 
known parameters of the problem. But if we maximize with respect to these 
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unknown parameters, we get 

N 

(2.7) mejcY.'^p,[Ui{s,t)]Ui{s,t)/2, 

' 1=1 

where 

(2.8) Wp^{x) = exp{x/2)/{rp^ +exp(j;/2)}, 

and = (1 —po)/po denotes the prior odds against the indicated hypoth- 
esis. We call this the weighted sum of chi-squares statistic. 

Both the mixture likelihood ratio statistic and the weighted sum of the 
chi-squares statistic are of the form of a maximum over s <t of random fields 
of the form g{Ui{s,t)], where 5 is a suitable function. In Section 3 we 
give an approximation for the false positive rate of such a maximum for 
general smooth functions g. The statistics we consider are all two-sided, and 
can be considered to be transformations of the statistic Uf{s,t). The 
transformation U"^ — >■ log[(l —po) -|-po6xp(C/^/2)] for the mixture likelihood 
and U"^ — )• Wpf^{U'^)U'^ for the weighted both effectively soft-threshold 
the statistics, decreasing small values toward zero. Figure 2 shows these 
transformations compared to the identity transformation for the sum of chi- 
squares statistic. The new statistics depend on the choice of po; with small 
values of pq requiring a more substantial apparent signal from a given se- 
quence of observations before that sequence is allowed to make an important 
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contribution to the overall statistic. For pQ = l, both recover the sum of the 
chi-squares statistic. See Figure 2. 

Remarks, (i) As we shall see in the power analyses of Section 4, these 
statistics are relatively robust with respect to the choice of pq. Consequently, 
we have not considered an adaptive or data driven method for estimating po- 
(ii) Our original preference was for the weighted sum of the chi-squares 
statistic, since the heuristic argument behind this statistic suggests that it 
will adapt better to the data than the mixture likelihood ratio. Our numer- 
ical experiments indicate, however, that the two statistics behave similarly, 
with the mixture likelihood ratio being more stable and often slightly more 
powerful. Hence, we report numerical results only for the mixture likelihood 
ratio statistic. 

3. Approximations for the significance level. For scan statistics of the 
form described above, we now give an analytic approximation for the signifi- 
cance level that accounts for the simultaneous testing of multiple dependent 
hypotheses. The approximation gives a fast and computationally simple way 
of controlling the false positive rate. 

As described in Section 2, we assume that the data is a matrix of indepen- 
dent, identically distributed random variables yi^t with mean zero, variance 
one and sufficiently small tails. Each row represents a process and there 
are N such processes. Given a starting point s and an interval length r, let 
JJ = {t : s <t < s + t} be a window of integers. Over this window construct, 
for each process, the sum Wi{JJ) = ^teJ-^ Vi,t consider the standardized 
statistic 



which again has mean zero and variance one. Let 5 be a smooth, positive 
(nonlinear) function and consider the statistic G'(JJ) = ^f=ig{Zi{Jl)]. For 
example, g{x) = log[(l —po) +po6xp(x^/2)] for the mixture likelihood ratio 
statistic. We are interested in the approximation of 



for A^, To, Ti and x diverging to +00 at the same rate. 

In applying the above formulation to (2.6) and (2.7), we have already as- 
sumed T is so large that the standard deviations can be estimated without 
error. To simplify the derivation, we also assume the baseline mean values 
can be estimated without error. At least for normally distributed variables 
and Ti <^T (the case of interest here), this assumption does not change 
the final approximation, and the required changes are straightforward oth- 
erwise. Hence, fn and di are treated below as known constants and Zi{Jl) is 



Z,{Jl)=T 



(3.1) 




max max 

s<T To<T<Ti 




10 



D. SIEGMUND, B. YAKIR AND N. R. ZHANG 



equivalent to f/j(s,s + r). When dealing with smaller (but still large) sam- 
ples, variation in the estimates of baseline parameters can be handled by 
modifications of the same method. 

To state our approximation, which involves an exponential change of mea- 
sure, we define the log moment generating function 

i;r{0)=logEexp{eg{Zr)}, 

where Zj- is a convenient notation for a random variable having the distri- 
bution of the Zi{JJ). Now choose Or to satisfy tpriGr) = x/N, and let 

(3.2) ^i{e) = le^ [ [5(z)]2e^3(^)-^^(^V(^) dz, 



where ip is the standard Gaussian density. 

Then, provided that T is subexponential in N, the probability in (3.1) is 
asymptotically equivalent to 

^ (T - r)e-^^^-^-(^^)-'^-(^-»{2^iV^,(0,)}-i/2 
T=To 

(3.3) 

xe-'f,\er){N/rfu\[2f,{er){N/T)]'/% 

where to a very good approximation 

i.(x) ^ [{2/x)Mx/2) - l/2}]/{(x/2)$(x/2) +<^(x/2)} 

[cf. Siegmund and Yakir (2007)]. For the case of central interest in this 
paper, the yij are standard normal, so ipr does not depend on r. Hence, 
several factors in (3.3) can be moved in front of the sum; and the sum of 
the remaining terms can be approximated by an integral, to obtain 

(3.4) 

/T 

xe-^i?{e) [ ' i^\[2Nfi{e)/{Tt)]^/^){i-t)dt/t^. 

Jto/T 

Remark, (i) For the sum of the chi-squares statistic, g{x) = x^, and (3.4) 
is essentially the same as the approximation in Zhang et al. (Zhang et al., 
2010) except that — 1 has been replaced by in two places, (ii) Although 
the derivation of (3.4) requires that Tq — )■ oo, by an auxiliary argument one 
can show in the normal case that the approximation remains valid for arbi- 
trarily small Tq, in particular, for Tq = 1. 
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Table 1 

Accuracy of approximate thresholds: The statistic is the mixture chi-square with 
parameters N = 100, To = l,Ti = 50, T = 500. The number of repetitions of the Monte 
Carlo experiment is 1000. Results in parentheses are thresholds in units of standard 

deviations above the mean 



Po 


Significance level 


Th (approx.) 


Th (MC) 


0.03 


0.10 


16.2 


15.3 


0.03 


0.05 


17.1 (8.7) 


16.8 


0.03 


0.01 


19.1 


19.2 


0.1 


0.10 


27.4 


26.3 


0.1 


0.05 


28.5 (6.64) 


28.6 


0.1 


0.01 


30.9 


31.3 


1.0 


0.10 


84.1 


83.9 


1.0 


0.05 


85.9 (5.08) 


85.8 


1.0 


0.01 


89.8 


99.8 



3.1. Accuracy of the approximation in the normal case. In this section 
we report a Monte Carlo experiment to verify the accuracy of the suggested 
approximations for normally distributed data. In Table 1 we consider the 
mixture likelihood ratio and give significance thresholds based on simulation 
and on the approximation (3.4). It seems difficult to develop intuition about 
the magnitude of these thresholds, so in a few cases we have also included in 
parentheses the thresholds measured in units of standard deviations above 
the mean. However, it does not seem substantially easier to develop intuition 
in this scale. The corresponding threshold for a single normally distributed 
sequence would be 4.3, so we see that in this scale the tail of the distribu- 
tion gets heavier with decreasing as one would expect. While the results 
in Table 1 indicate that the approximation is quite accurate, the param- 
eters A^, Ti and T are all relatively small, since the simulations become 
very time consuming for larger values. A second example is given in the 
Appendix. 

4. Power comparisons. For the statistic max^^T- G(JJ), when the variant 
interval is (ti, ri -|- r2], we consider as an approximation to the power of the 
probability 

v{G{j;i)>h}, 

where b is the threshold computed to achieve a pre-chosen significance level, 
say, 0.05. This probability is a lower bound on the true power, which also 
involves the much smaller probability that G{JJ) < b for s = ti,t = T2, but 
exceeds b for nearby s,t. This simple approximation can be evaluated us- 
ing a normal approximation or a small and fast Monte Carlo experiment 
involving only T2 x N observations. 
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We conducted a power analysis for detecting CNV using the AfFymetrix 
6.0 microarray platform, which contains ~1.8 million probes. We assumed 
that a separate scan is conducted for each chromosome. The average number 
of probes per chromosome is around 80,000, and, thus, as a rough approx- 
imation, we set the total length of a scan to be T = 80,000. We restricted 
our attention to the detection of short CNV, and, thus, we enforced a max- 
imum window size of Tq = 1000. We considered the detection of single copy 
insertions and deletions, and assumed the signal to noise ratios (SNR) are 
between 1 and 3. These are comparable to the signal to noise ratios of actual 
data sets. For example, for the Hapmap data set obtained from Affymetrix, 
we computed the signal to noise ratios of those CNV detected in Zhang et al. 
(Zhang, Senbabaoglu and Li, 2010) that are confirmed by fosmid sequencing 
data. We found that the signal to noise ratios for one copy gains are around 
1.5-3.5 and that for one copy losses are around 2-3.5. These SNR are higher 
than true signal to noise ratios, since only those regions with stronger sig- 
nals were detected. The false positive rate is controlled at 0.05/23 = 0.0022, 
which corrects for the multiple testing across chromosomes by the Bonferroni 
inequality. 

Figure 3 shows the power of detection of a CNV of length L that is 
present in a fraction p € {0.01,0.02,0.05,0.1} of the cohort, using the scan 
statistic (2.6) with a range of values for pQ. The size of the cohort is set 
to be 100 or 1000. The signal to noise ratio is 2 in the left column, and 1 in 
the right column. For each setting, Bonferroni corrected single-sample scans 
are compared to multi-sample scans. 

A few observations are worth noting from Figure 3. First, when N = 100 
and p = 0.01, that is, when only one out of 100 samples carries a change, 
a single sample scan has slightly greater power than a multi-sample scan 
using a small value of pq. In this case, using the sum of the chi-squares 
statistic {po = 1) can have very low power, which is expected. When the 
signal is present in only one sample, pooling across samples should not result 
in a gain of power. When the true fraction p is increased to 0.02, that 
is, only 2 out of 100 samples carry a change, then a multi-sample scan 
gives a substantial boost in power for po ^ 0.1. Furthermore, when the true 
fraction is p = 0.1, a multi-sample scan with any value of po £ (0.01,0.2) 
does better than a single sample scan. These results also indicate that for p 
not too small, the results for different assumed values of po are comparable. 

Regarding the range of the horizontal axes in Figure 3, note that for 
N = 100 and signal to noise ratio of 2, the range of interval lengths where 
we can expect a noticeable boost in power is typically less than about 10-12. 
For longer CNV, the power is already close to 1, so multi-sample scans do 
not give added benefit. Note also that if the signal to noise ratio is divided 
by / and the length of the interval is multiplied by the marginal power 
is unchanged. For example, if the signal to noise ratio is changed to 1, that 
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Fig. 3. Each plot shows the detection power versus length of CNV for a given setting of 
sample size N , signal to noise ratio (SNR) and fraction of carriers p. Going down each 
column, p increases while N and SNR are fixed. N — 100, SNR = 2 for the left column, 
and N = 1000, SNR = 1 for the right column. The different curves represent the mixture 
scan statistic (2.6) for different values ofpo, with the solid triangles representing the single 
sequence scan (see legend at bottom right). 
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is, 1/2 as large, the noticeable boost in power occurs for intervals up to four 
times as long, or about 40-50. 

A surprising observation is that, for the range of signal-to- noise ratios 
and interval lengths that seem relevant to the current microarray platforms, 
scan statistics using a small value of pq seem to be the winner under a wide 
range of conditions. Even when the true fraction of carriers is a moderate 
sized p = 0.1, using pQ = 0.01 gives almost the same power as po = 0.1 for 
most CNV lengths. The benefit in using a large value of is more noticeable 
when the signal to noise ratio is small while N and the percentage of carriers 
is large, as expected. (Results under a wider set of conditions are available 
in supplementary materials.) 

5. Validation on a biological data set. In Zhang et al. (Zhang et al., 2010) 
we illustrated our results on data obtained with a set of 62 Illumina 550K 
Beadchips from experiments performed on DNA samples extracted from 
lymphoblastoid cell lines derived from healthy individuals. These data were 
used recently as part of the quality assessment panel at the Stanford Human 
Genome Center (i.e., they were collected prior to studies of scientific interest 
to diagnose possible problems in the experimental protocol). The 62 samples 
are useful for method assessment because they represent 10 sets of (child, 
parent, parent) trios and 16 technical replicates of 16 independent DNA 
samples. We withhold the relation between samples during the detection 
process, so that the scanning algorithm is blind to this information, and 
use it afterward for validation. In Zhang et al. (Zhang et al., 2010) we used 
these data to demonstrate the improvement of multi-sample scans based on 
the sum of the chi-squares statistic over single sample scans. Here we make 
a similar comparison of the sum of the chi-squares statistic with the mixture 
likelihood ratio statistic. 

Data from most microarray based experiments exhibit various artifacts, 
including strong local trends, first documented in Olshen et al. (Olshen et 
al., 2004) and studied in detail in Diskin et al. (Diskin et al., 2008). Diskin 
et al. (Diskin et al., 2008) showed an association of these trends with lo- 
cal GC content, and proposed a regression-based method that reduced the 
magnitude of the local trends. Another problem for microarray-based experi- 
ments is that the noise variance varies significantly across probes, causing the 
bulk distribution of the intensities for each sample to deviate from normal. 
Such inhomogeneity of variances prompted Purdom and Holmes (Purdom 
and Holmes, 2005) to use a Laplace distribution, which can be derived from 
a mixture of normals with different variances, to model gene expression data. 

Before applying the cross sample scan, we preprocessed the data so that 
the assumptions of independence and normality can be valid. We adopted 
the following approach (let x = {xu -.i = 1, . . . , N;t = 1, . . . ,T} be the raw 
data): 
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1. Each sample is standardized to its median, that is, 

x-j = Xit — median(xjf :t = 1, . . . ,T). 

Let x' be the matrix of x-^ values obtained in this way. 

2. Let L be the rank-1 singular value decomposition (SVD) of x' , and let 

x" = x'-L. 

3. Standardize each SNP to have the same 84% and 16% quantiles as the 
standard normal distribution, that is, 

yit = x'-t/dt, 

where dt = {qt 84 — Qtie)/^, where qt z is the zth. quantile of {x'l^:i = 
1,...,N}. 

Empirically, we found that the rank-1 SVD of x' in step 2 effectively captures 
experimental artifacts such as local trends. This is because experimental 
artifacts can be viewed as a low-rank perturbation of the data. For example, 
Diskin et al. (Diskin et al., 2008) showed that local trends can be explained 
by a linear model using one predominant factor, the local GC content. In 
our data, we found that the rank-1 SVD can eliminate local trends more 
completely than the genomic waves software of Diskin et al., possibly because 
the local GC content is not accurately computed or because local GC content 
does not completely control for the artifacts. If the magnitude of the signal 
(i.e., the CNV regions) is large compared to the magnitude of artifacts, then 
parts of the signal would also be captured by the SVD and dampened in 
step 2. However, in normal DNA samples, the CNV regions are short and well 
separated. Thus, compared to the sparse signal, artifacts overwhelmingly 
contribute to the total data variation and almost completely determine the 
rank-1 SVD. Finally, standardizing the quantiles of each SNP in step 3 makes 
the assumption of normal errors with homogeneous variance not too far from 
the truth. 

Figure 4 shows the normal qq-plot and the autocorrelation plot for one of 
the 62 samples after this normalization procedure. The qq-plot shows that 
the bulk of the data now looks convincingly normal (the tails are heavier 
than normal due to CNV regions), with the adherence to normality more 
evident when we zoom in to a region that is visually confirmed to contain 
no CNVs [Figure 4(b)]. From Figure 4(c) we see that there is no detectable 
autocorrelation in the normalized data. 

To assess detection accuracy, we compare CNV identified for the two 
technical replicates of the same individual, and also compare those identified 
for the child with those identified for the parents. We define "inconsistency" 
of detections of CNV in individual samples as follows: (1) If a detected CNV 
in one of the replicate pairs is not detected in the second sample of the pair. 
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Normal Q-Q Plot: Sample 1 SNPS 2000-3000 on Chromosome 4 Autocorrelation of sample 1 
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(a) (b) (c) 

Fig. 4. Normal qq-plot [{a) and (b)/ and autocorrelation plot (c) for sample 1 of the 
Stanford Quality Control Panel data, after preprocessing. The qq-plot in (a) compares the 
distribution over all of the SNPs (on all chromosomes) for this sample against the standard 
normal distribution. The qq-plot in (b) zooms in on SNPs 2000-3000 on chromosome 4, 
which does not contain any visually identifiable CNVs. 

the CNV is considered inconsistent. (2) If a detected CNV in the child is not 
detected in at least one of the parents, it is considered inconsistent. Detection 
accuracy is thus assessed by plotting the number of total versus inconsistent 
detections, and different methods can be compared in such a plot. See Zhang 
et al. (Zhang et al., 2010) for a more complete discussion. 

This method of accuracy assessment requires the identification of the 
carriers of each CNV, and the method of identification affects the level of 
consistency. For example, if all of the samples are classified as "changed" at 
all CNV locations, then there would be many detections but no inconsisten- 
cies. In Zhang et al. (Zhang et al., 2010) we developed an empirically based 
thresholding method, which we use again here. 

Figure 5 shows the results for different settings of the parameters po and 
the sample detection thresholds. The horizontal axis is the number of total 
detections and the vertical axis is the number of inconsistent detections. 
Each line in the graph represents a different setting for po, and dots on the 
line refer to performance at varying values of a threshold parameter sug- 
gested in Zhang et al. (Zhang et al., 2010): 5j^lN' absolute difference in 
medians between the readings inside and outside the interval for a sequence 
to be called a carrier of a CNV. Within the range of 0.2-0.4, as (Jj^j^ de- 
creases, the size of the set declared to be carriers, as well as the number of 
inconsistencies, increases. 

Zhang et al. (Zhang et al., 2010) showed that using a multi-sample scan- 
ning algorithm with the sum of the chi-squares statistic gives higher consis- 
tency on these data than single sample analysis, and Figure 5 shows that 
a mixture model, with small values of po, gives an additional improvement. 
Using Po = 0.1 performs noticeably better, and po = 0.01 gives a slight ad- 
ditional improvement. 
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Tocal calls 

Fig. 5. Comparison of results on the Stanford Quality Control Panel data using the 
mixture likelihood ratio statistic (2.6). Each curve is for a different value ofpo. The points 
on the curve refer to different absolute median thresholds (0.2,0.3,0.4) for identifying 
carriers. 



Visual inspection of the data indicates that most CNV regions are car- 
ried by fewer than 10 samples. Thus, the fact that the mixture model with 
Po = 0.01 performs the best is consistent with the power computations in 
Section 4. We also found that the detected CNV region is often quite short. 
In many cases, consistent calls contained fewer than 5 SNPs. 

6. Discussion. Although the scan statistic relies on the unknown mix- 
ture fraction pQ, the power analyses show that it is quite insensitive to miss- 
specification of this parameter within reasonable ranges. Quite generally, 
the power is sensitive to the value of po only when p or pQ is very small. 
In practice, for N = 1000 it seems reasonable to do a separate scan using 
a few different values of po, such as po £ {0.001, 0.01, 0.1, 1}, and then apply 
a Bonferroni correction. One can also use a simple Monte Carlo approxima- 
tion for the marginal power as in Section 4 to find a good range of po to use 
under various conditions. 

From the power analysis in Section 4, where we assume the probe cov- 
erage and signal to noise ratios typical of the Affymetrix 6.0 microarray 
platform (between two and three standard deviations), we showed that the 
proposed method is expected to boost power significantly for detection of 
short CNV regions (<15 SNP coverage). When the signal to noise ratio is 
weaker (around 1 standard deviation), we can expect an improvement in 
power for CNV with less than 60 SNP coverage. This is the range of CNV 
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lengths where the current single sample detection methods fail. In our expe- 
rience such short CNV are the most abundant in the genome, and would be 
the most useful in a variety of studies. Many current genome-wide studies 
simply ignore CNV with less than, say, 10 SNP coverage, since they are 
not reliably detected with standard methods. However, when we pool data 
across samples, the power increases dramatically for the detection of such 
short CNV, even when only a few samples within the cohort are carriers. 

By assessing concordance across replicates and adherence to Mendelian 
inheritance in parent-child trios, we showed in Section 5 that the mixture 
likelihood ratio improves the accuracy of CNV detection, especially when 
the variant is rare. The accurate detection of rare variants makes these vari- 
ants available for genetic association studies and other studies of population 
genetics. 

The analytical approximation to the false positive rate given in Section 3 
is accurate across all ranges of A'' and po that we have tested. It allows 
instantaneous assessment of the false positive rate for genome-wide scans, 
where Monte Carlo methods are computationally infeasible. The theoretical 
framework for the approximation is not limited to Gaussian errors, and can 
be applied to other error models. 

APPENDIX A: INFORMAL DERIVATION OF (3.3) 

The approximation (3.3) is obtained using a general method for comput- 
ing first passage probabilities first introduced in Yakir and Pollack (Yakir 
and Pollak, 1998) and further developed in Siegmund and Yakir (Siegmund 
and Yakir, 2000) and Siegmund, Yakir and Zhang (Siegmund, Yakir and 
Zhang, 2010). The method relies on measure transformations that shift the 
distribution for each sequence over the scan window. We use the notation of 
Section 3. We omit some of the technical details needed to make the deriva- 
tion rigorous. These details have been described and proved in Siegmund, 
Yakir and Zhang (Siegmund, Yakir and Zhang, 2010). 

Recall the definition ipriO) = logEexp{0(7(Z^)}, where Zr is a generic 
standardized sum over all observations within a given window of size r in 
one sample, and the parameter 6 = 6^ is selected by solving the equation 
Ntpr{0) = X. Since is a standardized sum of r independent random vari- 
ables, ipr converges to a limit as r — ?• oo, and Oj- converges to a limiting 
value. We denote this limiting value by 6. The transformed distribution for 
all sequences at a fixed start position s and window size r is denoted by P^ 
and is defined via 

dPl = exp[0^G( JJ) - NiPr{Or)] dP. 

Let ^Ar(JJ) = log(dPJ/dP). Let D = {(s,t) :0 < s <T,To<t <Ti} be the 
set of all possible windows in the scan. Let A = {max(j, G{Jg) > x} be 
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= E E exp[^^(Jj;)]] 
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(A.l) 



(s,T)eD 

E 

(s,t)gD 

X e: 



„<?jv(JJ)-fiv(JJ) 



7VKV'r(e,)-5/'r(er)} 



(s,T)eZ) 



exp 



-^'iv(JJ)-logA/iv(JJ).^ 



where 



N 



i=l 

t,u L i=l ) 



N 



MM) = maxexp|^e„[5(Zi(J,")) - g{Zi{Jl))\ | . 

Since s and r are fixed in much of what foUows, we sometimes suppress 
the dependence of the above notation on JJ and simply write ijy, Sn, Mjy 
for i]y{Jg), Siy{Jg), and M]y{Jg), respectively. As explained in Siegmund, 
Yakir and Zhang (2010), under certain verifiable assumptions, a "locahzation 
lemma" allows simplifying the quantities of the form 

(A.2) El[{MN/SN)e-'^-'°^''^;iN + logM^v > 0] 

into much simpler expressions of the form 
(A.3) a^;,(2^)-i/2E[M/5], 

where (77v,t is the standard deviation of In and E[M/S] is the limit of 
E[M7v/«S'7v] as N ^ oo. This reduction relies on the fact that, for large 
and T, the "local" processes M^v and Sn are approximately independent 
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of the "global" process (.n- This allows the expectation in (A. 2) to be de- 
composed into the expectation of Mj^/Sn times the expectation involving 
£n + log Mat, treating log Mat essentially as a constant. 

We next analyze each of the terms in (A. 3) separately. First consider the 
processes Mjy and Sjy. The difference between standardized sums can be 
written in the form 

ujn - = z,{jr) - u-'/'w,{j:) + u-'/'w,{j:) - z^m 
= u-'/\w,{jn - Wiij:)) + jj)[(t/^x)V2 _ 

By taking a Taylor expansion of order two and keeping only the mean 
zero stochastic terms of order 0(iV~^/^) and deterministic terms of order 
0{N-^), we obtain 

+ Z,{Jl)g{Z,{Jl))r - 



2u 

+ ^ ( 2^ 2^ y^.o]■ 



It follows that 

N 



(A.4) YeA9{z.{J'n)-9{z^{J^)]^ E E ^7 



for 



"7 = =^(''-''''h(^dJ:))vJ 



Observe that one may substitute r for u and 6 = lim^_j.oo Or for Or in the 
definition of the increments and still maintain the required level of accuracy. 
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Consider the random variable -f/^ . Its first component has mean zero 
under the distribution determined by PJ, since the random variables Uij, 1 < 
i < N, are not in the interval JJ. By the central limit theorem, Hj~ converges 
to a normal random variable with variance that is approximately equal to 

varI[^+]«^2^varI(5(Zi(j;))yi,,)«^2^Ee[5(Z)2], 

with the distribution of the random variable Z given by a density propor- 
tional to (p{z)e^^^^\ for 9 the limit of 9^- The second component converges 
by the law of large numbers to 

E;[^+] « ^E,[ff(Z) - g{Z)Z] = -^9'^Eg[{g{Z))% 
where the last equation follows from integration of the identity 

^J^(z)g{z)e'3(^^] = [-zg{z) + g{z) + 9{g{z)fMz)e'a(-^ dz. 

Regarding the random variable , note that due to the sufficiency of 
the statistic Zi{Jl) and the exchangeability of the observations that form it 
under the null distribution, we get that the conditional expectation of y^j, 
given Zj(JJ), equals Zj(JJ)/-y/r. Straightforward computations, that essen- 
tially repeat those carried out for , show that 

K[HJ]^-\9'^M{9{Z)?]- 

For the variance of this term, since one can ignore o(l) quantities, we should 
approximate the expectation 

WA[g{Zr{Jl))fyl,}- 

But, if we denote hy Z = Zi{JJ \ {j}) the standardized sum of all the ob- 
servations in the first row excluding yij, and denote by the expectation 
with respect to the measure where g{Z) is used for the exponential change 
of measure, we get a negligible difference between the original expectation 
and 

El{[5(^)]'2/L} = El{[ff(^)]'} 

The difference is negligible due to the facts that the function h{z, 9) = 
[g{^z)\e^^^^^ is continuous with respect to both z and 9 and that Vr con- 
verges, as r — )■ oo to a continuous limit. The conclusion is that both types of 
increments converge to the same limiting normal distribution, with a mean 
value equal to minus one half the variance. 
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One may use the same technique in order to show that the covariance 
between any two increments is of the order of 0{1/N). 
The process 1^ has mean and variance 

cr^^^ = var^(^Ar) 

(A.5) =Neli:r{er) 

= N9lv^Tl{g{Z,{Jl)) 

and its covariance with an increment of the local process is of order N~^/'^, 
so asymptotically the two are independent. 

It follows from these calculations that the two local processes in (A. 4) 
which arise from perturbations at the endpoints of the interval (s, s + r] are 
asymptotically independent two-sided random walks. The increments are 
independent, identically distributed normal random variables. Moreover, in- 
tegrating by parts the analytic expression for El[g{Zi^s)], one sees that the 
absolute value of the mean of the local process equals half the variance. 
The random variables and Sn are respectively the maximum and sum 
of these local processes. Consequently, following Siegmund and Yakir (Sieg- 
mund and Yakir, 2000), we get that 

(A.6) E[M/S] = [{N/rMe)u{[2{N/rMep')f, 

where 

^,{e) = ^\e[{g{Z)}'] 

Combining (A.6) with (A.5) in (A. 3), and then substituting the result for 
the expectations in (A.l) yields (3.3). 

APPENDIX B: ANOTHER NUMERICAL EXAMPLE 

The numerical example discussed in Section 3.1 was limited to relatively 
small values of T by the extremely time consuming nature of the simulations. 
Here we give a somewhat different example where it is computationally 
feasible to consider larger T, since the scan statistic involves only a one- 
dimensional maximization. 

The statistic is 

TV 

(B.l) max Vlog[l-po+Poexp(C/2(jA)/2)], 

0<jA<£^ 

where the processes Ui{t) are independent stationary Ornstein-Uhlenbeck 
proc-esses with covariance function cov[Ui{s),Ui{t)] =exp{—(3\t — s\). This 
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Table 2 

Accuracy of approximate thresholds: The statistic is the mixture 

likelihood ratio for linkage, with parameters 
N = 1000, 1 = 1600, A = 1, /3 = 0.02. The number of repetitions of 
the Monte Carlo experiment is 1000 



Po 


Significance level 


Th (approx.) 


Th (MC) 


0.02 


0.10 


47.0 


47.5 


0.02 


0.05 


48.5 


48.9 


0.02 


0.01 


51.3 


51.8 


0.01 


0.10 


30.1 


29.2 


0.01 


0.05 


31.3 


31.5 


0.01 


0.01 


33.6 


33.8 



statistic would be reasonable as an approximation in a linkage study of the 
expression levels of N genes, regarded as quantitative traits (eQTL), when 
one is particularly interested in "master regulators," that is, genomic regions 
that control the expression levels of a collection of genes. See Siegmund 
and Yakir (Siegmund and Yakir, 2007) for a general discussion of linkage 
analysis and Morley et al. (Morley, 2004), Goring et al. (Goring et al., 2007) 
and Shi, Siegmund and Levinson (Shi, Siegmund and Levinson, 2007) for 
recent studies of linkage for eQTL and discussions of the existence of master 
regulators. In this case i is the length of the genome in centimorgans (taken 
here to be 1600, the approximate length of a mouse genome), A is the 
(average) genetic distance between markers, and /3 = 0.02 for a backcross or 
for the statistic associated with the additive effect of an intercross. Table 2 
gives numerical results for an approximation to the tail probability of (B.l), 
which was suggested by Siegmund, Yakir and Zhang (2010) and is analogous 
to (3.3), but is much simpler to derive. This approximation is also quite 
accurate. 

REFERENCES 

BiGNELL, G. R., Huang, J., Greshock, J., Watt, S., Butler, A., West, S., Grig- 
OROVA, M., Jones, K. W., Wei, W., Stratton, M. R., Futreal, P. A., Weber, B., 
Shapero, M. H. and Wooster, R. (2004). High-resolution analysis of DNA copy num- 
ber using oligonucleotide microarrays. Genome Res. 14 287-295. 

Colella, S., Yau, C., Taylor, J. M., Mirza, G., Butler, H., Clouston, P., Bas- 
SETT, A. S., Seller, A., Holmes, C. C. and Ragoussis, J. (2007). QuantiSNP: 
An objective Bayes hidden-Markov model to detect and accurately map copy number 
variation using SNP genotyping data. Nucl. Acids Res. 35 2013-2025. 

DiSKiN, S. J., Li, M., Hou, C., Yang, S., Glessner, J., Hakonarson, H., Bucan, M., 
Maris, J. M. and Wang, K. (2008). Adjustment of genomic waves in signal intensities 
from whole-genome SNP genotyping platforms. Nucl. Acids Res. 36 el26-|-. 

Goring, H. H., Curran, J. E., Johnson, M. P., Dyer, T. D., Charlesworth, J., 
Cole, S. A., Jowett, J. B. M., Abraham, L. J., Rainwater, D. L., Co- 



24 



D. SIEGMUND, B. YAKIR AND N. R. ZHANG 



MuzziE, A. G., Mahaney, M. C, Almasy, L., MacCluer, J. W., Kissebah, A. H., 
Collier, G. R., Moses, E. K. and Blangero, J. (2007). Discovery of expression 
QTLs using large-scale transcriptional profiling in human lymphocytes. Nat. Genet. 39 
f2Q8-f2f6. 

Lai, W. R., Johnson, M. D., Kucherlapati, R. and Park, P. J. (2005). Comparative 
analysis of algorithms for identifying amplifications and deletions in array CGH data. 
Bioinformatics 21 3763-3770. 

McCarroll, S. a. (2008). Extending genome-wide association studies to copy-number 
variation. Hum. Mol. Genet. 17 R135-R142. 

McCarroll, S. A. A., Kuruvilla, F. G. G., Korn, J. M. M., Cawley, S., Nemesh, J., 
Wysoker, a., Shapero, M. H. H., de Barker, P. I. W. I., Maller, J. B. B., 
KiRBY, A., Elliott, A. L. L., Parkin, M., Hubbell, E., Webster, T., Mei, R., 
Veitch, J., Collins, P. J. J., Handsaker, R., Lincoln, S., Nizzari, M., Blume, J., 
Jones, K. W. W., Rava, R., Daly, M. J. J., Gabriel, S. B. B. and Altshuler, D. 
(2008). Integrated detection and population-genetic analysis of SNPs and copy number 
variation. Nat. Genet. 40 1166-1174. 

MORLEY, M., MOLONY, C. M., TERESA, M., WeBER, T. M., DeVLIN, J. L., 

EwENS, W. G., Spielman, R. S. and Cheung, V. G. (2004). Genetic analysis of 
genome- wide variation in human gene expression. Nature 430 743-747. 
Olshen, a. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular bi- 
nary segmentation for the analysis of array-based DNA copy number data. Biostatistics 
5 557-572. 

Peiffer, D. a., Le, J. M., Steemers, F. J., Chang, W., Jenniges, T., Garcia, F., 
Haden, K., Li, J., Shaw, C. A., Belmont, J., Cheung, S. W., Shen, R. M., 
Barker, D. L. and Gunderson, K. L. (2006). High-resolution genomic profiling of 
chromosomal aberrations using infinium whole-genome genotyping. Genome Res. 16 
1136-1148. 

Pinkel, D. and Albertson, D. G. (2005). Array comparative genomic hybridization and 
its applications in cancer. Nat. Genet. 37 S11-S17. 

Pinkel, D., Segraves, R., Sudar, D., Clark, S., Poole, L, Kowbel, D., Collins, C, 
Kuo, W. L., Chen, C, Zhai, Y., Dairkee, S. H., Ljung, B. M., Gray, J. W. and 
Albertson, D. G. (1998). High resolution analysis of DNA copy number variation 
using comparative genomic hybridization to microarrays. Nat. Genet. 20 207-211. 

Pollack, J. R., Perou, C. M., Alizadeh, A. A., Eisen, M. B., Pergamenschikov, A., 
Williams, C. F., Jeffrey, S. S., Botstein, D. and Brown, P. O. (1999). Genome- 
wide analysis of DNA copy-number changes using cDNA microarrays. Nat. Genet. 23 
41-46. 

PURDOM, E. and Holmes, S. P. (2005). Error distribution for gene expression data. 
Statist. Appl. Genet. Mol. Biol. 4 16. MR2170432 

Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., Perry, G. H., Andrews, D. T., 
Fiegler, H., Shapero, M. H., Carson, A. R., Chen, W., Cho, E. K., Dal- 
LAiRE, S., Freeman, J. L., Gonzalez, J. R., Gratacos, M., Huang, J., Kalait- 
ZOPOULOS, D., KoMURA, D., Macdonald, J. R., Marshall, C. R., Mei, R., 
Montgomery, L., Nishimura, K., Okamura, K., Shen, F., Somerville, M. J., 
TcHiNDA, J., Valsesia, A., WooDWARK, C, Yang, F., Zhang, J., Zerjal, T., 
Zhang, J., Armengol, L., Conrad, D. F., Estivill, X., Tyler-Smith, C, 
Carter, N. P., Aburatani, H., Lee, C, Jones, K. W., Scherer, S. W. and 
Hurles, M. E. (2006). Global variation in copy number in the human genome. Nature 
444 444-454. 



SIMULTANEOUS VARIANT INTERVAL DETECTION 



25 



Shi, J., SlEGMUND, D. and Levinson, D. F. (2007). Statistical corrections of linkage data 

suggest predominantly cis regulations of gene expression. In Proceedings of the 2006 

Genetic Analysis Workshop, BMCC Proceedings I S145. 
SlEGMUND, D. O. and Yakir, B. (2000). Tail probabilities for the null distribution of 

scanning statistics. Bernoulli 6 191-213. MR1748719 
SlEGMUND, D. O. and Yakir, B. (2007). The Statistics of Gene Mapping. Springer, New 

York. MR2301277 

SlEGMUND, D. O., Yakir, B. and Zhang, N. R. (2010). Tail approximations for maxima 
of random fields by likelihood ratio transformations. Sequential Anal. 29 245-262. 

Snijders, a. M., Nowak, N., Segraves, R., Blackwood, S., Brown, N., Conroy, J., 
Hamilton, G., Kindle, A. K., Huey, B., Kimura, K., Law, S., Myambo, K., 
Palmer, J., Ylstra, B., Yue, ,1. P., Gray, ,1. W., Jain, A. N., Pinkel, D. and 
Albertson, D. G. (2001). Assembly of microarrays for genome-wide measurement of 
DNA copy number. Nat. Genet. 29 263-264. 

Tartakovsky, a. and Polunchenko, A. S. (2007). Decentralized quickest change de- 
tection in distributed sensor systems with applications to information assurance and 
counter terrorism. In Proceedings of the 13th Annual Army Gonference on Applied 
Statistics, Houston, TX. 

Walsh, T., McClellan, J. M., McCarthy, S. E., Addington, A. M., Pierce, S. B., 
Cooper, G. M., Nord, A. S., Kusenda, M., Malhotra, D., Bhandari, A., 
Stray, S. M., Rippey, C. P., Roccanova, P., Makarov, V., Lakshmi, B., Fin- 
dling, R. L., Sikich, L., Stromberg, T., Merriman, B., Gogtay, N., But- 
ler, P., Eckstrand, K., Noory, L., Gochman, P., Long, R., Chen, Z., Davis, S., 
Baker, C, Eichler, E. E., Meltzer, P. S., Nelson, S. F., Singleton, A. B., 
Lee, M. K., Rapoport, J. L., King, M.-C. and Sebat, J. (2008). Rare structural 
variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Sci- 
ence 320 539-543. 

Wang, K., Li, M., Hadley, D., Liu, R., Glessner, ,L, Grant, S. F. A., Hakonar- 
SON, H. and BuCAN, M. (2007). PennCNV: An integrated hidden Markov model de- 
signed for high-resolution copy number variation detection in whole-genome SNP geno- 
typing data. Genome Res. 17 1665-1674. 

WiLLENBROCK, H. and Fridlyand, J. (2005). A comparison study: Applying segmenta- 
tion to arrayCGH data for downstream analyses. Bioinformatics 21 4084-4091. 

Yakir, B. and Pollak, M. (1998). A new representation for a renewal-theoretic constant 
appearing in asymptotic approximations of large deviations. Ann. Appl. Probab. 8 749- 
774. MR1627779 

Zhang, N. R. (2010). DNA copy number profiling in normal and tumor genomes. In 

Frontiers m Gomputational and Systems Biology (J. Feng, W. Fu and F. Sun, eds.) 

259-281. Springer, London. 
Zhang, N. R., Senbabaoglu, Y. and Li, J. Z. (2010). Joint estimation of DNA copy 

number from multiple platforms. Bioinformatics 26 153-160. 
Zhang, N. R. and Siegmund, D. O. (2007). A modified Bayes information criterion with 

applications to the analysis of comparative genomic hybridization data. Biometrics 63 

22-32. MR2345571 

Zhang, N. R., Siegmund, D. O., Ji, H. and Li, J. Z. (2010). Detecting simultaneous 
change-points in multiple sequences. Biometrika. 97 631-644. 



26 



D. SIEGMUND, B. YAKIR AND N. R. ZHANG 



D. SlEGMUND 

N. R. Zhang 

Department of Statistics 
Stanford University 
Sequoia Hall 
390 Serra Mall 

Stanford, California 94305-4065 
USA 

E-MAIL: dosCSstat. stanford.edu 
nzhang@stanford.odu 



B. Yakir 

Department of Statistics 
Hebrew University of Jerusalem 
Jerusalem 91905 
Israel 

E-mail: msbyOmscc. huji.ac.il 



